Skip to content

fix(mothership): key resumes by orchestration id#3771

Merged
icecrasher321 merged 2 commits intostagingfrom
fix/mothership-id-key
Mar 25, 2026
Merged

fix(mothership): key resumes by orchestration id#3771
icecrasher321 merged 2 commits intostagingfrom
fix/mothership-id-key

Conversation

@icecrasher321
Copy link
Collaborator

Summary

Use orchestration associated id instead of run id to prevent duplicate resume callers.

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@cursor
Copy link

cursor bot commented Mar 25, 2026

PR Summary

Medium Risk
Changes the worker identity used to claim/release async tool-call continuations; incorrect IDs could still cause claim contention or stalled resumes in multi-worker scenarios.

Overview
Fixes async continuation resume de-duping by introducing a per-orchestration continuationWorkerId (instead of deriving the resume worker ID from runId/messageId).

Adds workerId to resume-related logs and ensures async tool-call claim/contension handling (claimCompletedAsyncToolCall/releaseCompletedAsyncToolClaim) consistently uses this new worker identity.

Written by Cursor Bugbot for commit bce6fde. Configure here.

@vercel
Copy link

vercel bot commented Mar 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 25, 2026 9:42pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR fixes a bug in the async tool continuation claim system where multiple concurrent requests sharing the same runId would derive an identical resumeWorkerId, allowing one worker to incorrectly treat another worker's claims as its own (the alreadyClaimedByWorker check on line 303 would fire spuriously).

The fix introduces continuationWorkerId — a stable-per-invocation, globally-unique identifier generated once with crypto.randomUUID() at orchestration start — and passes it through as the resumeWorkerId for the entire claim lifecycle. The previous fallback chain (continuation.runId || context.runId || context.messageId) is replaced outright. Additionally, workerId is added to all log payloads in the async continuation loop, which improves distributed traceability.

Key changes:

  • continuationWorkerId is generated once (line 129) and held constant for the entire orchestration call, including across multiple outer-loop iterations (multiple continuations), which is correct since claims are scoped to per-tool-call IDs rather than checkpoints
  • workerId logging is added at 6 log sites in the resume loop for better observability

Confidence Score: 5/5

  • Safe to merge — targeted one-file fix with no regressions and improved log observability.
  • The change is minimal and directly addresses the root cause: concurrent orchestrations sharing a runId can no longer accidentally treat each other's claims as their own. The UUID is stable per invocation (correct for retry semantics within the same call) and unique across invocations (correct for isolation). All log sites gain the workerId field for traceability. No existing logic is removed or re-ordered; the only behavioral change is the identity used for claim ownership.
  • No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/copilot/orchestrator/index.ts Replaces `continuation.runId

Sequence Diagram

sequenceDiagram
    participant R1 as Request A (runId: X)
    participant R2 as Request B (runId: X)
    participant DB as Async Tool DB

    Note over R1,R2: Before fix — shared runId means shared workerID
    R1->>DB: claimCompletedAsyncToolCall(toolCallId, "runId:X")
    DB-->>R1: claimed ✓
    R2->>DB: claimCompletedAsyncToolCall(toolCallId, "runId:X")
    DB-->>R2: already claimed — but claimedBy === resumeWorkerId ⚠️
    Note over R2: alreadyClaimedByWorker=true (wrong!)

    Note over R1,R2: After fix — each invocation gets unique UUID
    R1->>DB: claimCompletedAsyncToolCall(toolCallId, "sim-resume:uuid-1")
    DB-->>R1: claimed ✓
    R2->>DB: claimCompletedAsyncToolCall(toolCallId, "sim-resume:uuid-2")
    DB-->>R2: claim contention → retry / backoff
    Note over R2: missingToolCallIds path — correct behaviour
Loading

Reviews (1): Last reviewed commit: "Merge branch 'staging' into fix/mothersh..." | Re-trigger Greptile

@icecrasher321 icecrasher321 merged commit 438defc into staging Mar 25, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/mothership-id-key branch March 25, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant